JMIR Medical Informatics
◐ JMIR Publications Inc.
Preprints posted in the last 90 days, ranked by how well they match JMIR Medical Informatics's content profile, based on 17 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit.
Boyne, A.; Feygin, M.; Sholeen, J.; Zimolzak, A.
Show abstract
BackgroundLength of stay (LOS) is a critical metric for hospital operational efficiency. While structured clinical data is widely used to predict LOS, unstructured admission notes contain latent prognostic information regarding diagnostic uncertainty and disease complexity. This study evaluates the efficacy of extracting sentiment and direct LOS estimates from admission notes to predict patient hospitalization duration. MethodsWe conducted a retrospective study of 4,503 adult patients admitted with community-acquired pneumonia between 2013 and 2023. Admission history and physical notes were preprocessed and filtered to extract physician-generated narratives. We evaluated four natural language processing models, VADER, TextBlob, Longformer, and an open-source large language model (GPT-oss-20B), to generate zero-shot sentiment scores. Additionally, GPT-oss-20B was prompted to directly estimate LOS. Model outputs were correlated with actual LOS using linear regression and Pearson correlation coefficients. ResultsSentiment models demonstrated statistically significant, albeit weak, correlations with actual LOS. Longformer achieved the highest variance explained among sentiment classifiers (R2 = 0.019). Direct LOS estimation by the LLM outperformed sentiment-based approaches, demonstrating the strongest correlation with actual hospital duration (r = -0.218, p < 0.001). Model agreement was generally poor (ICC = 0.059), and computational time varied drastically, from 2.6 seconds per 100 notes (TextBlob) to over 370 seconds (GPT-oss-20B). ConclusionZero-shot sentiment analysis of clinical notes yields a small but measurable correlation with LOS, limited primarily by the objective, non-evaluative nature of clinical documentation. Direct LLM estimation of clinical outcomes outperforms emotional sentiment extraction. Future predictive systems should integrate computationally efficient NLP models capable of capturing latent clinical complexity alongside established structured data variables.
Zhang, Z.; Wei, J.; Xu, J.; Li, Y.; Luk, A.; Bhalla, S.; Cui, H.; Clifton, D. A.; Walker, A. S.; Eyre, D. W.
Show abstract
Timely identification of hospital inpatients at risk of deterioration facilitates interventions to support their recovery. Many hospitals implement early warning scores to detect abnormal patient vital signs, such as the National Early Warning Score 2 (NEWS2). However, these are typically based on a snapshot of the most recent vital signs, rather than exploiting trends overtime that clinical intuition suggests may also be informative. Multiple approaches, including recently described methods, have been developed to predict patient deterioration from time series. We therefore compared the effectiveness of different mortality prediction models, including clinical scoring systems, classical machine learning models and state-of-the-art deep learning models using both snapshot and time series vital sign data. No significant improvement in model performance was observed using predictions from time series compared to using the last observation of the time series and non-temporal features such as demographics. Our study comprehensively compares different model types, and provides recommendations for developing predictive models and guidance for what evaluation is needed before considering deploying such models in inpatient care.
Patel, K.; Beedala, P.
Show abstract
BackgroundMachine learning models for intensive care unit (ICU) mortality prediction achieve strong internal discrimination yet rarely undergo external validation with calibration assessment -- a gap undermining clinical deployment. Calibration, the agreement between predicted probabilities and observed event rates, is prerequisite for threshold-based decisions yet remains underreported. MethodsWe conducted a retrospective cohort study using MIMIC-IV (v2.2; n = 52,028 ICU stays) for model development and eICU (n = 114,060) for independent external validation. Logistic regression, random forest, and gradient boosting (XGBoost) were evaluated on first-24-hour clinical variables. Discrimination was assessed via receiver operating characteristic area (AUROC) and precision-recall area (AUPRC); calibration via slope, intercept, and expected calibration error (ECE). Post-hoc logistic recalibration was applied externally. Clinical utility was evaluated by decision curve analysis benchmarked against Acute Physiology and Chronic Health Evaluation (APACHE) scores. Subgroup analyses examined sex and race/ethnicity; SHapley Additive exPlanations (SHAP) assessed feature importance. Uncertainty was estimated via bootstrap resampling; the study adheres to TRIPOD guidelines. ResultsThe recalibrated XGBoost model achieved internal AUROC 0.847 (95% CI: 0.832-0.860) and external AUROC 0.819 (95% CI: 0.815-0.823). Internal calibration was near-ideal (slope 0.982; intercept 0.001), whereas external validation revealed systematic risk overestimation (intercept -0.678) attributable to prevalence-driven label shift. An intercept-only adjustment reduced ECE by 26%. The model outperformed APACHE (AUROC 0.817 vs. 0.795; p < 0.001). ConclusionsICU mortality models exhibit transportable discrimination but clinically significant calibration drift under cross-institutional deployment. Calibration evaluation and targeted recalibration should be mandatory in any clinical machine learning validation framework.
Naderalvojoud, B.; Sutjiadi, B. J.; Koul, A.; Curtin, C.; Gevaert, O.; Hernandez-Boussard, T.
Show abstract
Background Machine learning (ML) models are increasingly used to predict adverse outcomes after surgery. However, most rely on static patient characteristics (e.g., age, comorbidities) and overlook clinician-controlled treatment decisions that can be actively modified at the point of care. Discharge opioid prescribing is a key modifiable, clinician-controlled decision, yet optimizing prescribing choices across multiple adverse outcomes remains underexplored in predictive modeling. This study addresses that gap by introducing a novel ML framework that explicitly separates fixed patient risk factors from modifiable prescribing options to support personalized, risk-informed opioid prescribing decisions. Methods We developed the Hierarchical Clinical Fusion Transformer (HCF-Transformer), an ML model designed to estimate patient-specific risks across four postoperative outcomes: prolonged opioid use (POU), chronic pain (CP), 30-day readmission, and opioid-associated outcomes (OAO). The model constructs patient risk profiles from fixed, non-modifiable baseline factors, followed by a transformer layer. Clinician-controllable discharge opioid regimens are modeled as alternative intervention candidates and fused with the fixed risk representation through a clinical fusion mechanism, enabling assessment and ranking based on predicted risks. A Total Relative Risk (TRR) metric, calibrated to each outcome prediction threshold, guides the recommendation process. We evaluated the model in diabetic surgical patients, a common high-risk population. Results The study included 157,853 unique diabetic surgical patients, with outcome prevalences ranging from 47.2% (POU) to 1.8% (OAO). The HCF-Transformer achieved the highest AUROCs, 0.798 for POU, 0.712 for 30-day readmission, 0.808 for CP, and 0.922 for OAO, outperforming Random Forest, FT-Transformer, and ResNet-based models. Compared to these baselines, HCF-Transformer generated more stable and discriminative risk estimates and demonstrated significant variation in TRR scores across discharge opioid options (ANOVA p < .01, eta-squared > .01). This enabled consistent identification of lower-risk regimens tailored to patient-specific profiles. Conclusions The HCF-Transformer introduces a novel hierarchical fusion approach to optimize opioid prescribing by integrating static patient risk profiles with modifiable discharge options. Using transformer-based modeling and a quantifiable TRR metric, the model delivers personalized, risk-aware recommendations. This approach enables data-driven opioid prescribing tailored to individual risk and has the potential to improve postoperative outcomes in high-risk populations. Our findings demonstrate that integrating modifiable factors with structured risk profiles through a transformer-based fusion architecture can enhance decision-support systems, paving the way for more actionable and personalized AI in healthcare.
Kale, S.; Singh, D.; Truumees, E.; Geck, M.; Stokes, J.
Show abstract
High inpatient opioid exposure is associated with increased risk of persistent opioid use. Early identification of high-risk patients may improve opioid stewardship. We developed machine learning models to predict high opioid exposure during hospitalization using electronic health record data from MIMIC-IV. We conducted a retrospective study of 223,452 unique first hospital admissions in MIMIC-IV. The outcome was high opioid exposure, defined as the top decile among opioid-exposed admissions (MME/day [≥] 225), representing 2.65% of all admissions. Structured early-admission features included demographics, admission characteristics, laboratory utilization and abnormality summaries, and 24-hour procedural indicators. Discharge-note data were incorporated using ClinicalBERT embeddings and interpretable bigram features. Models were trained using an 80/10/10 split and evaluated with temporal validation on the most recent 10% of admissions. Performance was assessed using ROC-AUC and PR-AUC with 95% confidence intervals. Among structured-only models, XGBoost achieved the best test performance (ROC-AUC 0.932 [0.924-0.940]; PR-AUC 0.223 [0.193-0.262]). The combined structured and notes model improved precision-recall performance (ROC-AUC 0.932 [0.920-0.943]; PR-AUC 0.276 [0.229-0.331]). Temporal evaluation showed similar discrimination (ROC-AUC 0.929; PR-AUC 0.223). High-risk bigrams included procedural terms such as "external fixation" and "cervical discectomy." Integration of structured and text-derived features improved risk stratification compared to structured data alone. Interpretable bigram signals reflected procedural complexity and orthopedic pathology, reinforcing the clinical plausibility of model predictions. Multimodal EHR-based models accurately predict high inpatient opioid exposure and may support targeted opioid stewardship during hospitalization.
Alsammani, A.; Johnson, M.; Elrefaei, J.
Show abstract
Objective: To develop, calibrate, and interpret machine learning models for predicting in-hospital mortality among intensive care unit (ICU) patients using clinical data collected during the first 24 hours of admission. Methods: We analyzed 53,866 adult ICU admissions from the MIMIC-IV (v2.2) database, including 5,787 in-hospital deaths (10.7%). An enhanced feature-engineering pipeline generated 88 laboratory-based features that captured distributional characteristics, temporal trends, and measurement frequency. Five machine learning classifiers were evaluated: L2-regularized logistic regression, random forest, XGBoost, LightGBM, and a calibrated soft-voting ensemble. Models were developed using a stratified 64:8:8:20 split for training, validation and hyperparameter tuning, calibration, and testing. Performance was assessed on a held-out test set (n = 10,774) using the area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), Brier score, calibration analysis, decision curve analysis (DCA), and SHAP-based model interpretation. Results: The calibrated ensemble achieved the best overall performance, with an AUROC of 0.856 (95% CI: 0.846-0.867), an AUPRC of 0.449 (95% CI: 0.418-0.480), and a Brier score of 0.078. XGBoost (AUROC 0.856; AUPRC 0.435) and LightGBM (AUROC 0.854; AUPRC 0.436) demonstrated performance comparable to the ensemble and significantly outperformed logistic regression (AUROC 0.823; AUPRC 0.376), yielding absolute AUROC improvements of approximately 0.031-0.033 (p < 0.001). Calibration substantially improved probabilistic predictions, reducing Brier scores by 42% for XGBoost (0.134 to 0.078) and 50% for LightGBM (0.151 to 0.076). Decision curve analysis demonstrated consistent net clinical benefit across the 5%-20% risk-threshold range. Key predictors included age, blood urea nitrogen, ICU subtype, measurement frequency, and lactate-related features. Model performance remained robust across ICU subtypes, with AUROC values exceeding 0.79. Conclusion: A calibrated and interpretable machine learning framework based on early ICU clinical data provides accurate and clinically actionable mortality risk estimates. By integrating trajectory-aware feature engineering, probabilistic calibration, and decision-analytic evaluation, this approach advances ICU mortality prediction toward more reliable and trustworthy clinical decision support systems.
Corga Da Silva, R.; Romano, M.; Mendes, T.; Isidoro, M.; Ravichandran, S.; Kumar, S.; van der Heijden, M.; Fail, O.; Gnanapragasam, V. E.
Show abstract
Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. Objective: To evaluate clinician-reported time savings, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methods: In this prospective, single-arm pilot study, 29 clinicians across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations and a final Net Promoter Score. Non-parametric methods were used throughout. Results: Clinicians reported high perceived time saving (mean 4.27/5; 95% CI: 3.97-4.57) and decision support (4.16/5; 95% CI: 3.86-4.45), with ratings stable across all study days and no evidence of attrition bias. The NPS was 81.2, with no detractors. Conclusions: Clinicians across specialties and career stages reported sustained satisfaction with DR. INFO for both time efficiency and clinical decision support. Validation in larger, controlled studies with objective outcome measures is warranted. Keywords: Medical AI assistant, LLMs in healthcare, Agentic AI, Clinical decision support, Point of care AI
Collier, A.
Show abstract
BackgroundNursing documentation patterns may reflect patient acuity and clinical deterioration, yet their prognostic value remains underexplored. We developed the Intensive Documentation Index (IDI), a novel framework quantifying temporal documentation rhythms, and evaluated its ability to enhance ICU mortality prediction.58 MethodsWe analyzed 26,153 ICU admissions of heart failure patients from the MIMIC-IV database (2008-2019). Nine IDI features capturing documentation rhythm, volume, and surveillance gaps were extracted from electronic health record timestamps during the first 24 hours of ICU stay. We compared logistic regression models with and without IDI features using temporal validation and race-stratified analysis.2124 ResultsThe cohort had a mean age of 68.5 {+/-} 13.2 years and an in-hospital mortality rate of 15.99% (n=4,181). The baseline model (age, sex, ICU length of stay) achieved an AUC of 0.658 (95% CI 0.609-0.710). Addition of nine IDI features significantly improved discrimination to 0.683 (95% CI 0.631-0.732), an absolute increase of 0.025 (p<0.05, DeLong test). Leave-one-year-out cross-validation across 12 years yielded a mean AUC of 0.684 (SD 0.008). The coefficient of variation of inter-event intervals (idi_cv_interevent) was the strongest predictor (OR 1.53 per SD, 95% CI 1.38-1.70, p<0.001). Model performance was consistent across racial and ethnic groups (AUC range 0.673-0.691), with no evidence of systematic bias. ConclusionsDocumentation rhythm patterns, captured through the IDI framework, significantly enhance ICU mortality prediction beyond traditional clinical variables. The association between documentation irregularity and mortality may reflect nursing workload, patient acuity, or care processes warranting further investigation. IDI represents a novel, readily available prognostic signal that could inform future clinical decision support systems.25
Monfared, V.
Show abstract
Osteoporosis is often called a silent disease because it progresses without symptoms until a fracture occurs, posing a serious, yet frequently overlooked, threat to women health. In response to the pressing need for early detection, we introduce OsteoInsight, an intelligent, AI-powered web application designed to assess osteoporosis risk with both clinical accuracy and interpretability. Built on a Random Forest classifier trained on over 2000 women health records, our model incorporates a wide range of domain-informed features, including hormonal history, lifestyle factors, reproductive health, and conditions affecting bone health. Despite an imbalanced dataset, with around 75% of cases being osteoporosis-positive, the model achieved encouraging results: 71.81% accuracy, an F1-score of 0.79, and an AUC-ROC of 0.78. SHAP analysis highlighted age, BMI, and menstrual history as key predictors, offering transparent insights into the model reasoning. Additional contributors like fracture history, signs of low estrogen, and lactation duration were also found to be significant, enriching the interpretability of predictions. These insights are seamlessly integrated into OsteoInsight user interface, making risk assessments not only accessible but also understandable for both clinicians and users. Our findings underscore the potential of AI-driven tools to enhance early screening and enable personalized risk profiling, empowering women and healthcare providers to take proactive steps in osteoporosis prevention.
Mangut, E.; Wallace, R.
Show abstract
Background: Professionalism and effective communication are foundational determinants of patient safety and quality of care. Unprofessional behaviors frequently serve as active precursors to adverse clinical events. However, proactive organizational surveillance is often hindered because incident feedback exists primarily as unstructured, free-text data. This study aimed to develop and validate a Natural Language Processing (NLP) pipeline and interactive dashboard to proactively monitor the "professionalism climate" within NYC Health + Hospitals, the largest municipal healthcare delivery system in the United States. Methods: A high-fidelity synthetic dataset (N=400) was computationally generated to safely mirror historical incident logs across 11 acute facilities without utilizing Protected Health Information (PHI). A rule-based NLP pipeline was developed in R utilizing the tidytext package. Unstructured narrative feedback was tokenized and classified into three core domains: Respect, Safety, and Communication. To validate the pipeline's accuracy, a 25% random stratified sample (n=100) was evaluated against independent, blinded manual coding performed by two reviewers, with inter-rater reliability measured via Cohen's Kappa. Finally, an interactive Tableau dashboard was developed to operationalize and visualize these metrics for ongoing surveillance. Results: The NLP algorithm achieved an overall accuracy of 85.8% (95% CI: 79.0-92.6), with 81.2% sensitivity and 88.9% specificity. The highest domain-specific performance was observed in Communication (88.0% accuracy). Manual validation demonstrated strong inter-rater reliability (k=0.84). Operational analysis via the dashboard revealed that 61.8% of reports occurred during the Tour 2 shift (15:00 to 23:00), aligning with peak operational volume. Furthermore, Respect-related feedback was reported at a disproportionately high frequency during the Tour 3 shift (23:00 to 07:00), accounting for over 50.7% of overnight feedback submissions. Conclusion: Rule-based NLP successfully transforms qualitative healthcare feedback into structured, actionable intelligence with high specificity. Integrating this pipeline into operational dashboards transitions safety culture surveillance from a reactive, manual exercise to a proactive, scalable system, enabling targeted, data-driven interventions by hospital leadership.
Ahammed, F.
Show abstract
Fraud in the health landscape is an aggravating issue, with far-reaching consequences burdening the financial stability of the health industry and threatening the quality of medical care. It results from vulnerabilities within the current healthcare framework that are exploited by the fraudsters in their favor. In spite of many developed models that aim to detect fraudulent patterns in insurance claims, the accuracy of such models frequently suffers as a result of the imbalance issue of the Medicare dataset and irrelevant features. This study ventures to improve detection performance and accuracy by employing a deep learning model along with data sampling and feature selection techniques. Comparative analysis among different combinations is conducted to determine their efficacy to enhance the accuracy of the fraud detection model. Hence, the suggested model clearly demonstrates that a combination of myriad data sampling and feature selection techniques is helping to improve accuracy and performance. The accuracy was thus 95.4%, with negligible evidence of overfitting detected using both Chi-square and Synthetic Minority Over-sampling (SMOTE) techniques. Ultimately, the study findings underscore the significance of employing combined techniques instead of using only the baseline deep learning model for better performance in detecting Medicare insurance fraud.
Zhang, Y.; Trinh, S. H.; Phelan, T.; Byrd, T. F.; Tourani, R.; Kumar, V.; Caraballo, P. J.; Melton, G. B.; Simon, G. J.
Show abstract
Background: Sepsis is a life-threatening condition in which delayed recognition and treatment are associated with increased mortality. While predictive models such as Epic's Early Detection of Sepsis Model (ESM) were developed to support early intervention, their real-world impact after integration into clinical workflows remains difficult to evaluate. Objectives: To evaluate the real-world impact of ESM integrated into clinical workflow on clinical outcomes, antibiotic use, and harm-benefit tradeoffs. Methods: We conducted a quasi-experimental study in a single healthcare system using encounter-level data from inpatient settings. Inpatient mortality, prolonged hospitalization, antibiotic use, and sepsis prevalence were compared between the pre-implementation period (3 June 2023 to 20 August 2024) and the online period (21 August 2024 to 26 December 2024) when the model became visible to clinicians. We also applied a counterfactual framework using models trained on pre-implementation data to estimate expected outcomes without ESM and to quantify harms related to overtreatment and delayed treatment. Results: Among 101,138 encounters, 86,884 occurred during the pre-implementation period and 14,254 during the online period. In unadjusted analyses, the online period had lower inpatient mortality, prolonged hospitalization, antibiotic use, and sepsis prevalence (all p[≤]0.002). In the counterfactual analyses, observed outcomes were lower than expected without ESM for mortality (1.21% vs 1.82%; p<0.001), prolonged hospitalization (5.56% vs 7.95%; p<0.001), and antibiotic use (43.52% vs 47.04%; p<0.001). False positive harm (37.72% vs 41.68%; p<0.001) was also lower than expected. Conclusions: Integration of ESM into clinical workflow was associated with improved patient outcomes, reduced antibiotic use, and decreased harm from overtreatment, without evidence of increased harm from delayed treatment, supporting a positive net clinical benefit and the safety and effectiveness of ESM under Software as a Medical Device principles. Keywords: Machine learning, Electronic health records, Clinical workflow, Counterfactual analysis, Real-world evaluation
Jarrett, P.; Reeder, J.; McDonald, S.; Diercks, D.; Jamieson, A. R.
Show abstract
STRUCTURED ABSTRACTO_ST_ABSObjectiveC_ST_ABSTo evaluate a ranking approach for emergency department (ED) waiting room prioritization that uses pairwise clinical comparisons aggregated via a Bradley-Terry model, and to assess its cross-site stability without site-specific training. Materials and MethodsUsing the Multimodal Clinical Monitoring in the Emergency Department (MC-MED) dataset (118,385 ED visits, Site A), we defined a composite deterioration outcome (intensive care unit [ICU] admission, intubation, vasopressor, ventilation, or death within 6 hours) and evaluated 7 queue-ordering policies across 1,000 simulated shifts. The primary endpoint was Recall@5 for deteriorators; secondary endpoints included area under the receiver operating characteristic curve (AUROC) and simulated time-to-provider (TTP) metrics. External validation used MIMIC-IV-ED (425,087 visits, Site B) with 500 shifts. Methods reported per TRIPOD-LLM. ResultsOn MC-MED, BT-LLM-Enriched (Bradley-Terry ranking with a large language model [LLM] judge, GPT-4.1, using full diagnoses and medications) exceeded the Emergency Severity Index (ESI) on the primary endpoint: Recall@5 0.587 vs. 0.491 (p<0.001). XGBoost achieved Recall@5 0.648 but required large site-specific labeled training data. On external validation, supervised model performance attenuated (XGBoost AUROC 0.892 to 0.807) while BT-LLM-Enriched remained stable (0.826 to 0.831); the two were statistically indistinguishable on external data. DiscussionUnder external validation, supervised model performance attenuated while zero-shot LLM ranking remained stable, suggesting cross-site stability without requiring site-specific training data. ConclusionPairwise ranking with an LLM judge significantly outperforms ESI-based ordering and remains stable across sites without local training, matching supervised models on external data.
Kim, Y. W.; Lau, W.; Patel, N.; Kendrick, K.; Wu, A.; Feldman, T.; Ahern, R.; Oka, A.
Show abstract
Background: The Kansas City Cardiomyopathy Questionnaire (KCCQ) is a validated patient-reported outcome measure for heart failure. However, its clinical utility is limited by incomplete and inconsistent data collection. We aimed to develop and validate machine learning models to estimate KCCQ overall summary scores from electronic health record (EHR) data. Methods: We assembled a retrospective cohort of 10,889 heart failure patients with recorded KCCQ scores from the Truveta database. Predictor features were derived from structured EHR variables across 13 historical time windows (15-360 days). Multiple regression algorithms were evaluated, followed by SHapley Additive exPlanations (SHAP)-based feature reduction and nested cross-validation for hyperparameter optimization. Model performance was assessed using the coefficient of determination (R2), mean absolute error (MAE), and ordinal discrimination and calibration for categorical severity classification. Results: Histogram-based gradient boosting (HGB) with HGB-SHAP feature selection achieved the strongest performance, reducing feature dimensionality by more than 94\% while maintaining estimation accuracy. The 240-day window performed best (R2=0.522, MAE=12.485). For categorical severity classification, the model demonstrated strong ordinal discrimination (mean ordinal AUROC=0.850). Quantile-based calibration improved classification balance, increasing the F1-score for the most severe category (KCCQ<25) from 0.180 to 0.428 and the quadratic weighted kappa from 0.601 to 0.640. Longer EHR observation windows were associated with improved prediction performance. Conclusion: Machine learning models can estimate KCCQ scores from routine EHR data with clinically meaningful accuracy and strong discriminatory performance. This approach may help extend assessment of patient-reported health status to populations in which survey-based data are incompletely captured, supporting population-level cardiovascular outcomes assessment and risk stratification in heart failure care.
Collier, A.
Show abstract
Background Electronic health record documentation patterns may reflect workflow complexity, monitoring intensity, and operational strain in intensive care settings. However, documentation-derived features can be sensitive to local documentation culture, data capture systems, and outcome definitions. Retrospective validation across multiple datasets is therefore needed before these signals are used in workflow intelligence or clinical AI governance tools. Objective To evaluate whether documentation-density and documentation-timing features show reproducible retrospective signal for ICU workflow complexity and long-stay proxy outcomes across de-identified critical care datasets, while distinguishing workflow and long-stay associations from unsupported claims about mortality prediction, burden reduction, or deployment readiness. Methods We synthesized retrospective validation results from de-identified ICU and workflow datasets generated through a prespecified documentation-density validation program. Feature families included Documentation Burden Score style features, Shift-End Documentation Rate style features, documentation reliability style metadata, and all-documentation feature sets where available. Outcomes included long ICU length of stay proxies, mortality where available, and workflow proxy endpoints. Models compared baseline feature sets with enhanced models containing documentation-density or workflow features. Performance was summarized using area under the receiver operating characteristic curve, Brier score where reported, delta AUROC, bootstrap confidence intervals where reported, and label-shuffle controls where available. Results The strongest external long-stay proxy evidence came from the NWICU chartevents analysis, which included 28,612 ICU stays, 20,267 stays with chart events, and 9,619,759 chart events. For ICU length of stay greater than the median, baseline AUROC was 0.5252. Enhanced AUROC was 0.9512 for Documentation Burden Score features, 0.9214 for Shift-End Documentation Rate features, 0.8470 for documentation reliability style features, and 0.9517 for all documentation features. Corresponding label-shuffle enhanced AUROCs were near random, ranging from 0.4897 to 0.5064. For ICU length of stay greater than the 75th percentile, baseline AUROC was 0.5155. Enhanced AUROC was 0.9433 for Documentation Burden Score features, 0.9194 for Shift-End Documentation Rate features, 0.8118 for documentation reliability style features, and 0.9427 for all documentation features, with label-shuffle enhanced AUROCs from 0.4836 to 0.4999. Additional retrospective support was observed in eICU workflow analyses, HiRID first-24-hour documentation-density analyses, MIMIC-IV HF ICU internal analyses, MIMIC-IV-Note metadata extensions, and nursing-chart or lab density proxy analyses. However, cross-institution discrimination transfer was weak without recalibration, and several analyses remained proxy validations rather than final clinical validations. Conclusions Documentation-density and documentation-timing features show promising retrospective signal for ICU workflow complexity and long-stay proxy outcomes, especially in NWICU chartevents and selected internal dataset-specific analyses. These findings support further preregistered, prospective, silent-mode validation of documentation-derived workflow intelligence. They do not establish prospective clinical performance, mortality reduction, clinician burden reduction, autonomous deterioration prediction, or deployment readiness.
Dickens, A. R.
Show abstract
BackgroundAutomated sepsis early-warning systems have attracted substantial research investment, yet a fundamental question remains unresolved: do these models detect independent biological signals, or do they predominantly learn care-process intensity -- the pattern of clinician ordering behavior applied to patients already suspected of being ill? We report a pre-registered falsification study testing this hypothesis across four independent clinical datasets. MethodsA four-phase falsification framework with pre-specified thresholds was registered on OSF (March 11, 2026) before any data access. The primary confirmatory analysis used MIMIC-IV v3.1 (n=65,241 adult ICU stays, Beth Israel Deaconess Medical Center, 2008-2022). Exploratory replication analyses used eICU-CRD v2.0 (n=136,864, 208 US hospitals), MIMIC-III v1.4 (n=44,091), and the PhysioNet/CinC 2019 Sepsis Challenge (n=40,314). Each phase tested a distinct falsification criterion: (1) concordance across Sepsis-2, Sepsis-3, and CMS SEP-1 definitions; (2) model performance degradation when care-intensity proxy features are removed; (3) predictive performance of care-intensity features alone; and (4) discriminability of synthetic records generated to match care-intensity distributions. ResultsThe pre-registered primary analysis (MIMIC-IV) did not confirm the hypothesis (0/4 phases confirmed). Biological features predicted Sepsis-3 labels with AUROC 0.901 (95% CI 0.899-0.904); removing care-intensity features reduced performance by only 0.003 AUROC (drop=0.0027). The pre-specified Phase 3 threshold (care-only AUROC >0.70) was not met by the primary logistic regression model (AUROC 0.660); however, a sensitivity XGBoost model did exceed the threshold (AUROC 0.729), suggesting nonlinear care-intensity signal. However, a clinically significant finding emerged consistently across all four datasets: mean pairwise Jaccard similarity between clinical sepsis definitions and administrative coding (CMS SEP-1) was approximately 0.32 at the primary site and 0.20 across multi-center cohorts, indicating that hospital quality metrics and regulatory reporting systematically measure a different patient population than clinical definitions identify. Exploratory analyses revealed a detectable care-intensity signal in the eICU multi-center cohort (AUC drop=0.076) not present at the single academic center. ConclusionsAt an elite academic medical center, sepsis prediction models detect genuine biological signal. Care-process leakage is not the primary driver of model performance in MIMIC-IV. The more consequential and robust finding is the systematic divergence between clinical and administrative sepsis definitions across all datasets examined, which has direct implications for regulatory reporting, pay-for-performance metrics, and the validity of AI benchmarks built on administrative data.
Shah, K. P.; Airan Javia, S.; Savage, T.; Bressman, E.
Show abstract
End-of-rotation handoffs are critical for patient safety but add to documentation burden for hospitalists. Generative artificial intelligence (AI) may help automate handoff creation using electronic health record data, but its impact on quality and safety is unclear. Methods: We developed an AI handoff tool with a large language model using clinical notes as input and conducted a retrospective evaluation comparing AI-generated and clinician-authored handoffs. Handoffs were assessed across domains of quality and safety through a structured review. Results: Quality ratings were similar between AI and human handoffs (3.7 vs. 3.5, p=0.57). AI-generated handoffs were rated higher for organization (4.4 vs. 4.1, p=0.05) and completeness (4.1 vs. 3.6, p=0.01), but lower for conciseness (3.7 vs. 4.1, p=0.03) and accuracy (4.1 vs. 4.4, p=0.03). Error rates were comparable (0.3/handoff in both groups); however, AI-generated handoffs included inaccuracies (9% of AI errors) and hallucinations (1% of AI errors), while clinician-authored handoffs contained only omissions. Conclusion: Human and AI handoffs have differing error profiles and tradeoffs between completeness and conciseness. Prospective evaluation in clinical workflows is underway.
Musholt, T. J.; Clerici, T.; Bergenfelz, A.; Schmidt, C. O.; Struckmann, S.
Show abstract
Background: Medical registries have gained importance in the evaluation of healthcare quality outcomes. In the absence of high-quality evidence, such as randomized controlled trials, studies based on registry data are essential for informing clinical guidelines. Methods for assessing data quality are rarely described in detail. To ensure the credibility of registry-based studies, registries must use all available technical and operational means to guarantee high data quality. Method: Eurocrine(R) is a pan-European endocrine surgical database and quality registry initially funded by the EU healthcare programme, which started in 2015 and now includes more than 200,000 interventions as of April 2025. To ensure high data quality, interactive and standardized reports are created via Microsoft Power BI, which are created both centrally and locally. In addition, comprehensive data quality analyses were performed via the R-based package dataquieR. Results: Although a multitude of technical measures (for example, input screen design and real-time plausibility checks during data entry) are in place, they are not sufficient to prevent human errors at data entry. Errors identified in the reports were corrected, and preventive measures were implemented. Overall, the data quality was assessed as very good in terms of completeness, accuracy, and consistency. Conclusion: It is very important to provide registry users with an efficient and smart tool to identify data issues, as they have the clinical information to correct them. Data quality reports generated with dataquieR represent an effective tool for registry administrators. Predesigned Microsoft Power BI reports enable participating Eurocrine(R) clinics to self-audit their data.
Benning, L.; Hirsch, A.; Groeschel, M.; Roeschl, T.; Spott, M.; Hans, F. P.; Urban, T.; Busch, H.-J.; Meyer, A.; Madrid, J.
Show abstract
Background Emergency department (ED) triage is a high-stakes clinical decision process that determines patient prioritization and resource allocation under time pressure. Large language models (LLMs) have recently been proposed as decision-support tools for triage, yet most evaluations rely on simulated scenarios or curated datasets. Evidence from real-world clinical environments remains limited. The objective of this project was to systematically evaluate the performance, calibration, and reproducibility of multiple contemporary large language models for Emergency Severity Index (ESI) classification and sectoral allocation (ED vs. urgent care practice, UCP) using a comprehensive real-world triage dataset. Material and Methods Retrospective cross-sectional benchmarking study conducted at a tertiary academic emergency ED in Germany with an integrated central point of assessment (CPA). The study included all consecutive adult walk-in encounters (>18 years) presenting between October 2023 and February 2024 (N = 16,107). Data were collected from a structured clinical decision support system capturing presenting complaints, vital signs, and triage decisions recorded by specialized nursing staff. Structured clinical variables routinely collected at triage, including presenting complaint categories (CEDIS-PCL), vital signs according to the ABCDE framework, and additional structured or free-text clinical information. Results The primary outcome was the agreement between LLM-predicted and nurse-assigned ESI levels measured using quadratic-weighted Cohen's k. Secondary outcomes included sectoral assignment agreement, misclassification patterns (over- and under-triage), calibration metrics, and output reproducibility. Quadratic-weighted k values ranged from 0.18 to 0.75 across models. Only a structured stepwise prompting strategy achieved substantial agreement (k_qw = 0.747), approaching reported human inter-rater reliability. Most models demonstrated moderate or lower agreement and systematic overconfidence, with expected calibration errors (ECE) based on verbalized confidence ranging from 0.099 to 0.355. Sectoral assignment agreement (i.e. ED vs. urgent care practice, UCP) was uniformly low (k < 0.30). Reproducibility testing revealed substantial variability in 23% of cases, indicating non-deterministic output behavior for clinically relevant decisions. Conclusions Current large language models demonstrate heterogeneous and generally limited performance in real-world emergency triage tasks. Structured algorithm-guided prompting appears more influential than model architecture or size. Before clinical implementation, improvements in calibration, reliability, and workflow integration are required, alongside regulatory-compliant validation in prospective clinical settings.
Miran, S. A.; Cheng, Y.; Faselis, C.; Brandt, C.; Vasaitis, S.; Nesbitt, L.; Zanin, L.; Tekle, S.; Ahmed, A.; Nelson, S. J.; Zeng-Treitler, Q.
Show abstract
ObjectivesTo develop and evaluate predictive models for unused outpatient appointments (missed or cancelled) using a large national electronic health record (EHR) repository in the United States. DesignRetrospective observational study using machine learning and statistical modeling. SettingA U.S. national electronic health record repository (Cerner Real World Database) covering healthcare encounters from 2010 to 2025. ParticipantsAdult patients aged [≥]18 years with routine outpatient encounters recorded in the database. One outpatient appointment with a known status was randomly selected per patient, resulting in a final analytic sample of 5,699,861 encounters. Primary and Secondary Outcome MeasuresThe primary outcome was whether the index outpatient appointment was attended or unused (missed or cancelled). Model performance was evaluated using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. MethodsPredictors included patient characteristics (demographics and insurance type), appointment characteristics (day, time, season, and urbanicity), prior cancellation rate, and time gap between the index appointment and the previous visit. We compared the predictive performance of two machine learning models (random forest classifier and extreme gradient boosting (XGBoost)) with logistic regression. An explainable AI analysis of feature impact was performed on the final XGBoost model. ResultsAmong 5,699,861 outpatient encounters, 3,650,715 (64.0%) were attended and 2,049,146 (36.0%) were unused. XGBoost achieved the best predictive performance on the test dataset (AUC = 0.95), followed by random forest (AUC = 0.92) and logistic regression (AUC = 0.89). Feature impact score analysis revealed highly non-linear associations between predictors and the risk of unused appointments at the individual level. ConclusionsUnused outpatient appointments can be accurately predicted using routinely available EHR data. Integrating predictive models into scheduling workflows may improve healthcare efficiency and optimize appointment management. Article SummaryStrengths and limitations of this study O_LIThis study used one of the largest national electronic health record datasets to develop predictive models for unused outpatient appointments. C_LIO_LIMultiple modeling approaches, including logistic regression and machine learning methods (random forest and XGBoost), were compared to evaluate predictive performance. C_LIO_LIAn explainable artificial intelligence method was applied to quantify feature impact and improve model interpretability. C_LIO_LIThe retrospective design and reliance on routinely collected EHR data may introduce data quality limitations and unmeasured confounding. C_LIO_LIThe database did not distinguish clearly between cancelled appointments and no-shows. C_LI